Objective. Using Python, the project goal is to implement a k-means clustering algorithm, a technique often used in machine learning, and use it for data analysis. We write various functions making use of lists, sets, dictionaries, sorting, and graph data structures for computational problem solving and analysis.


Part 1. Spotify API Data

Metadata.

  • name: The name of the track.
  • album: The name of the album on which the track appears.
  • artist: The name of the artist who performed the track.
  • release_date: The date the album was first released.
  • length: The track length in milliseconds.
  • popularity: The popularity of the track. Values are between 0 and 100. The popularity is calculated by an algorithm based on the total number of plays the track has had and how recent those plays are.

Artists.

  • artist_pop: The popularity of the artist. The value will be between 0 and 100, with 100 being the most popular. The artist’s popularity is calculated from the popularity of all the artist’s tracks.
  • artist_genres: A list of the genres the artist is associated with.

Audio Features.

  • acousticness: A confidence measure from 0.0 to 1.0 of whether the track is acoustic.
  • danceability: Describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm, beat strength, and regularity.
  • energy: A measure from 0.0 to 1.0 that represents a perceptual measure of intensity and activity.
  • instrumentalness: Predicts whether a track contains no vocals. The closer the value is to 1.0, the more likely the track contains no vocal content.
  • liveness: Detects the presence of an audience in the recording. Higher values represent an increased probability that the track was performed live.
  • loudness: The overall loudness of a track in decibels (dB). Values are averaged across entire track, ranging between -60 and 0 db.
  • speechiness: Detects the presence of spoken words in a track. The more speech-like the recording, the closer to 1.0.
  • tempo: The overall estimated speed or pace of a track in beats per minute (BPM).
  • valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. High valence sound more positive (e.g. happy, cheerful, euphoric).
  • key: The key the track is in. If no key was detected, the value is -1.
  • mode: The modality (major or minor) of a track. Major is represented by 1 and minor is 0.
  • time_signature: An estimated time signature (how many beats are in each measure), ranging from 3 to 7 indicating time signatures of “3/4”, to “7/4”.

Get Playlist Data from API

First, we create a Client Credentials Flow Manager used in server-to-server authentication by passing the necessary parameters to the Spotify OAuth class. We provide a client id and client secret to the constructor of this authorization flow, which does not require user interaction.

# Set client id and client secret
client_id = 'xxx'
client_secret = 'xxx'

# Spotify authentication
client_credentials_manager = SpotifyClientCredentials(client_id, client_secret)
sp = spotipy.Spotify(client_credentials_manager = client_credentials_manager)

Now we can get the full details of the tracks of a playlist based on a playlist ID, URI, or URL. Choose a specific playlist to analyze by copying the URL from the Spotify Player interface. Using that link, the following code uses the playlist_tracks method to retrieve a list of IDs and corresponding artists for each track from the playlist.

for link in playlist_links:
    playlist_URI = link.split("/")[-1].split("?")[0]
    # Iterate over list of tracks in playlist
    for i in sp.playlist_tracks(playlist_URI)["items"]:   
        track_ids.append(i['track']["id"]) # Extract song id
        artist_ids.append(i['track']["artists"][0]["uri"]) # Extract artist id

Then, we write a function that takes the playlist data from the API and gets the audio features of each track. The following code loops through each track ID in the playlist and extracts the song information by calling the function we created. From there, we can create a dataframe by passing in the returned data and giving the column header names we want.

# Loop over track ids
all_tracks = [playlist_features(track_ids[i], artist_ids[i], playlist_ids[i])
              for i in range(len(track_ids))]
X name track_id album artist artist_id release_date length popularity artist_pop artist_genres acousticness danceability energy instrumentalness liveness loudness speechiness tempo valence key mode time_signature playlist
0 2 AM 3g3RCV5ImXwzHpKwM2iunc Pure Infinity SwaVay 29gIYsdyccGoUc6qgkZeTK 2019-05-24 198577 55 51 [‘atl hip hop’, ‘indie hip hop’, ‘underground hip hop’] 0.434 0.783 0.341 9.85e-05 0.362 -12.353 0.0727 126.799 0.184 7 1 4 but my feet in bottega
1 Golden Child 04QWC97Dvd9g0IEDoyUDBX Lady Wrangler Shaboozey 3y2cIKLjiOlp1Np37WiUdH 2018-10-05 177773 46 56 [‘pop rap’] 0.362 0.792 0.591 1.90e-06 0.360 -8.848 0.2900 151.029 0.365 0 1 4 but my feet in bottega
# Number of rows and columns
rows, cols = df.shape
print(f'Number of songs: {rows}')
## Number of songs: 100
print(f'Number of attributes per song: {cols}')
## Number of attributes per song: 23

Part 3. Similar Artists Web Visual

First, we want to find the most frequently occurring artist in a given playlist. We use the value_counts function to get a sequence containing counts of unique values sorted in descending order.

# Count distinct values in column
tallyArtists = df.value_counts(["artist", "artist_id"]).reset_index(name='counts')
topArtist = tallyArtists['artist_id'][1]
##         artist               artist_id  counts
## 0   Juice WRLD  4MCBfE4596Uoi2O4DtmEMz       9
## 1  Post Malone  246dkjvS1zLTtiykXe5h60       9
## 2    SAINt JHN  0H39MdGGX6dbnnQPt6NQkZ       4
## 3        Quavo  0VRj0yCOv2FXJNP47XQnx5       3
## 4   Young Thug  50co4Is1HCEo8bhOyUWKpn       3

References